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“The conductor’s baton technique is the means by which he communicates with the play- 
ers and singers under his direction. It is not an end in itself and must become so automatic 
that he can concentrate on musical interpretation, rather than on matters of technique.” 


—J. Lumley and N. Springthorpe, [JLNS1]. 


Abstract 


A global pandemic of rhythmic restlessness has struck regular concert-goers the world 
over. This illness typically manifests itself in a desire to conduct recordings of the Berlin 
Philharmonic, or some other renowned orchestra, from one's living room sofa. Many who 
fall foul to this disease are left with the bitter aftertaste of unrequited love however, as the 
lack of interactivity in most of today's music playback devices precludes any response 
from the music. 


This report details the development of an interactive, vision-based system that ad- 
dresses this problem, providing an application through which amateur conductors can im- 
prove their conducting technique. The focus is on determining the conductor's [potentially 
varying] tempo and playing back the music in time with it. Achieving this requires baton 
tracking, beat detection and beat prediction. 


A suitable tracking algorithm for this application must be able to track the baton in real 
time against a cluttered environment. We will explore the use of the condensation al- 
gorithm as a potential baton tracker. To evaluate its accuracy, we will compare its perform- 
ance to that of a shape recognition-based tracking algorithm that performs well in un- 
cluttered environments. We will also investigate methods for detecting beats after they 
have occurred and predicting the time at which the next beat will occur. 
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Part I: INTRODUCTION 


Of all the members of an orchestra, the conductor is the one person who is unable to 
work on his technique at any time he wishes; not even the most charismatic of conductors 
have whole orchestras available to them for practise any time of the day or night. This can 
be an even bigger problem for beginner conductor's who may not yet have an orchestra to 
conduct at all. 


Although basic baton motions can be practised alone (e.g. beating time in front of a mir- 
ror), doing so is of limited benefit, as a conductor's development needs to be guided by the 
invaluable feedback of an orchestra (an inexperienced conductor may be unable to judge 
how clear his gestures will appear to an orchestra). Having said that, high quality private 
practise and study is essential for all musicians, including the conductor. Certain tools, 
such as the metronome and tuning forks, have gained wide acceptance as useful practise 
aids. The development of a tool to improve the quality of a conductor's private technical 
practise could prove to be similarly beneficial. 


This report focuses on the development of such a tool. More specifically, it deals with 
the design and creation of a system that we have called a Conductible Virtual Orchestra; a 
computer program that is able to “watch” how a conductor conducts a piece of music, in- 
terpret his gestures as indications of the tempo he desires and then play back the music in 
synchronisation with this tempo. 


A number of other researchers have developed similar projects in the past, many of 
which required the use of expensive, unwieldy hardware. Some of these are discussed in 
section II.1. For such a system to have widespread appeal amongst amateur/student con- 
ductors however, a low-cost approach would be vital. So for this project we focused on the 
use of a webcam as the means by which the program can sense the conductor’s motion, as 
webcams are generally very cheap (even for students!). 


The idea of sensing the conductor’s motion through images captured by a camera falls 
into a rapidly growing area of Computer Science known as Computer Vision. This area 
deals with the formidable task of endowing computers with the ability to interpret images, 
an ability that comes naturally to us humans that we take for granted. Section II.3 of this 
report presents some of the theoretical material from this area that is relevant to our pro- 
ject. The ways in which we used and further developed that material in order to realise the 
goals of this project are discussed in section III.2. 


Part II: BACKGROUND 


This part of the report introduces the background material needed to implement the 
system, starting with a summary of some related work. Where an in-depth report of a par- 
ticular subject area would go out of the scope of the report, references to more detailed 
sources are given. 


II.1 Related Work 


A variety of approaches to computer-based conducting systems have been taken by 
various researchers to date. The following gives a brief summary of some of these systems, 
and explains how their success and shortcomings have influenced the approach that we 
we have decided upon for our system. 


One of the earliest developed computer conducting systems is M. V. Mathews’ Radio 
Baton and Conductor Program, described in [MVM1]. This system is based on two batons 
with foam-covered radio transmitters attached to the ends. By measuring the strength of 
the radio signal at five receiving antennae, the system is able to track the position of the 
batons in 3D space. Baton movements cause MIDI triggers to be sent to a synthesiser, al- 
lowing the user to control the tempo, dynamics and other aspects of the predetermined 
composition that the synthesiser is playing. 


Numerous other systems have since been developed that use sensor-based hardware 
devices to track the conductor’s gestures, including [MLGGDW1] and [JBWSMM1], which 
use the Buchla Lightning infrared baton ([DB1]). The system described in [J]BWSMM1] is 
an interactive exhibit for a music exhibition, and so was targeted at amateur users who 
have no knowledge of professional conducting gestures. It is of note though because it 
transformed recorded audio and video data of a real orchestra playing in order to follow 
the user’s tempo rather than using synthesised sounds (the audio data was time-stretched 
using the Fourier transform and the video data was synchronised by skipping or repeating 
frames as necessary). 


Other sensor-based approaches are described in [TITT1] and [TMN1], which both use 
body-sensor-equipped jackets to measure the motion of the conductor’s limbs. In addition 
to this, the jacket used by T. M. Nakra in [TMN1] also measures muscular tension in the 
arms, heart rate, respiration, skin conductance and body temperature. Nakra’s analysis of 
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the results revealed the importance of respiration and muscular force in communicating 
musical phrasing and intensity/loudness. 


The advantage of using such sensor-based devices in a computer conducting system is 
that they are able to produce accurate 3D measurements quickly. Efficient computation is 
critical to such systems as they must operate in real-time, analysing streams of data 
sampled many times per second. One of the main disadvantages however is that in some 
cases they can restrict the freedom of the conductor. Concerning the use of a digital baton 
in an earlier project, Nakra says in [TMN1]: 


“The baton’s size and heaviness were not conducive to graceful, comfortable ges- 
tures; it was 5-10 times the weight of a normal cork-and-balsa wood conducting 
baton. A typical 45-minute gestural Brain Opera performance with the 10-ounce Di- 
gital Baton was often exhausting. This also meant that I couldn’t take it to orchestral 
conductors to try it out; it was too heavy for a conductor to use in place of a tradi- 
tional baton.” 


Another disadvantage is that the cost of developing such devices can potentially be 
quite high; the target users of my system, beginner conductors, aren’t likely to be willing 
to spend large amounts of money in order to use it. 


For our application we decided to take a vision-based approach similar to D. Murphy’s 
in [DMTAKJ1], in which the motion of the conductor’s baton is tracked in the video output 
of a webcam. The advantage of developing this kind of system is that it wouldn’t interfere 
with the conductor’s movements, nor would it be beyond the price range of the average 
conducting student. The major disadvantage is the difficulty of designing an efficient and 
robust tracker. Murphy reports attaining a tracking rate of 25 frames/sec in [DM1] at a res- 
olution of 160x120 on a relatively low-spec machine, demonstrating that good results are 
attainable by this method. Another thing to note is that tracking would produce less accur- 
ate estimates of the baton’s position than some of the above-mentioned methods, which 
may make the gesture analysis stage of the application more difficult. 


Murphy’s system allows for two cameras to be used, one facing the conductor and one 
to his side. To reduce financial and computational costs however, we just used one camera. 
This circumvents the technical problem of synchronising and calibrating two cameras. 
Furthermore, by placing the camera to the side of the conductor rather than in front of him 
as in Murphy’s single-camera system, we reduce the variation in the length of the projec- 
tion of the baton onto the camera, and we eliminate the problem of the baton “disappear- 
ing” when it points straight at the camera. A profile-view camera placement also simplifies 
tracking, as the side-to-side motion of the baton is less pronounced when viewed from the 
side. This is a valid position for the camera to be in as it has the same view of the conduct- 
or here that the 1“ violin section or the ‘cello section of a symphony orchestra have. 


II.2 The Art of Conducting 


This section gives a brief overview of conducting technique. Conducting is a highly ex- 
pressive art, requiring years of study and practise to master, and so a full treatment of the 
subject would of course go outside the scope of this report. We will therefore only present 
the aspects of conducting that are directly related to this project. The reader is referred to 
[JLNS1], from which the source material for much of this section was taken, for a more de- 


tailed guide to conducting. 


11.2.1 The Conductor 


[JLNS1] identifies the two basic requirements that a conductor must fulfil as “know[ing] 
the music in great detail and [having] the technique to communicate this knowledge”. The 
authors go on to say in chapter 3 that “communication is achieved by arm and wrist move- 
ments and by changes in posture and facial expression, the most important element being 
the movement of the right arm, which indicates the tempo and dynamics of the music.” It 
is thus the aspects of communication pertaining to the motion of the right arm that this 
project is concerned with. In particular we will concentrate on the way a conductor com- 


municates the tempo of a piece to the orchestra. 


11.2.2 Beating Time 


Fig. I1.2-1 — The The beat of a piece of music is a regular pulse that underlies the 
preparatory beat. rhythmical variations of each bar. As such it determines the tempo of 

the music, and so it is critical for each member of the orchestra to be 
aware of this pulse in order for them to play together. There is an am- 
biguity inherent in most pieces of music in that multiple pulses can of- 
ten be felt, each with a frequency that’s a multiple of the others. To re- 
solve this, the conductor will usually conduct the pulse that’s most 
comfortable for himself and the orchestra. 


It is very important for the orchestra to know what tempo the con- 
ductor will start conducting the piece of music with and precisely 
when he wants the piece of music to start. To communicate this, con- 
ductors conduct a preparatory up-down beat one beat before the or- 
chestra should start (see fig. I[.2-1). The upwards motion indicates the 
duration of half a beat. From this the orchestra can predict when the downwards motion 
will end, which is the point at which they should come in. The horizontal directions of the 
upwards and downwards motions vary depending on the beat of the bar that the music 
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starts on. 


After the preparatory beat, the tempo is indicated to the orchestra by means of a period- 
ical right-arm gesture that indicates the point in time at which each beat, sub-beat or some 
multiple of the beat occurs. Although each individual conductor may have personal styl- 
istic preferences as to the precise form of these gestures, there are universally accepted 
general forms that are common amongst most conductors. Figs. II.2-2 (a)-(c) illustrate the 
gestures for conducting 2, 3 or 4 beats per bar. The numbered crosses indicate the ictus, the 
point at which each beat occurs, the numbers indicate the beat and the circles indicate the 
half beats. Note that the ictus always occurs when the baton changes from a downwards 
motion to an upwards motion. It’s also important to note that the major downwards mo- 
tion always coincides with the first beat of the bar. 


Fig. II.2-2 — The patterns for conducting 2, 3 or 4 beats per bar, as seen from the conductor's point of view. 
@ 


(a) 2 beats (b) 3 beats (c) 4 beats 


Variations on these patterns can be used to communicate other things. For example the 


smoothness of the path traced by the baton describes 


whether the music should be played legato (smoothly) or M9: 12-3 ~ Beating 3 beats per 


bar staccato. 
staccato (detached). Fig. II.2-3 illustrates an example of con- 
ducting staccato. The dynamics (loudness) of the music is in- 
dicated by the size of the pattern; small gestures for quiet 
moments, larger ones for louder moments. 
) 


I1.3 Tracking 


A vast array of techniques for tracking the motion of objects have been proposed over 
the years. The success of any given technique is often highly dependent on the problem at 
hand; a method that works well in one scenario won't necessarily work as well in another. 


The two common general approaches to tracking that are most appropriate to this pro- 
ject are blob tracking and contour tracking. The former involves segmenting the images of 
a sequence and determining which segments (blobs) correspond to the object of interest. 
The latter involves tracking the motion and, if appropriate, the deformation of a particular 
curve through a sequence of images. 


For our application we considered two different algorithms for tracking the motion of 
the conductor's baton. The first algorithm, a blob tracker, uses a statistical shape recogni- 
tion technique to identify which extracted segment in the image corresponds to the baton. 
The second algorithm is a contour tracker known as the condensation algorithm (see 
[MI1]). It works by propagating a conditional probability distribution that represents our 
beliefs about the state of the baton through time. 


Sections III.2.1 and III.2.2 of this report describe our implementation of these algorithms 
in detail. The next two sections present the elementary concepts of computer vision and 
statistics that the algorithms are based on. 


II.3.1 Image Processing 


When tracking an object or performing any other machine vision task, it is often neces- 
sary to process the input images in order to reduce noise and other artifacts, and to extract 
useful information that will be subjected to further analysis later on. This section gives a 
synopsis of the image processing operations that will be referred to later in this report. 


NOTE: throughout the following sub-sections, f(x) and f(i, j) are used to refer to the value 
of pixels x and (i, j/) respectively of an intensity image f. The techniques can generally be 
applied to multi-channel images by applying them to each channel separately. 


II.3.1.1 Convolution 


[MWWRI1] defines convolution as “an integral that expresses the amount of overlap of 
one function g as it is shifted over another function f. It therefore ‘blends’ one function 
with another”. It is the basis of a number of image filtering methods, and is formally 
defined as follows: 


(f* glx) = | fle)-(x- 2 )de 


Or in its discrete form it is: 


(f *g(x)= Yo f(a): g(x- n) 


n 


Where: 


J * g is the convolution of function f with g. Function g is sometimes referred to as 
the kernel function. 


II.3.1.2 Noise Filtering 


Many different types of filters exist for reducing the amount of noise in an image. The 
one this report will refer to most is the Gaussian filter. It has the effect of blurring an im- 
age without introducing visual artifacts such as horizontal/vertical lines that certain other 
blurring methods create. 


In its true form, the Gaussian filter is defined as the convolution of a Bivariate Gaussian 
distribution with f, where the mean of the distribution is taken as the location of the de- 
sired output pixel. A commonly used approximation however is to define a (2R+1)x(2R+1) 
matrix G called a convolution filter that has the form of a truncated, discretised Gaussian 
distribution. The discrete form of the convolution equation can then be used: 
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A typical example of G might then be: 
i Le 
G=—|2 4 2), for R=1 
| | 


Another useful blurring operation is the median blur. This operation sets a pixel’s in- 
tensity to the median intensity of itself and its local neighbours. It has the property of pre- 
serving sharp edges rather than blurring them. 


II.3.1.3 Edge Detection 


Knowing where the edges of objects in an image are is often useful for segmentation and 

various other operations. Edges are defined as areas of an image where there is a sudden 

change in intensity. We can use convolution filters to find such points, e.g. the Sobel oper- 
ator: 


Sa: 30-4 te 5 a 

Gx=|-2 0 2 Gy-|0 0 0 

sit #050 er 

Gap) = ( f * Gx’ |(p) Gy(p) = (f*Gy'|(p) 
pil ¥Gd p)? + Grp 
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|G(p)| is a measure of the strength of the edge at pixel p. 


Where: 


O(p) gives the orientation of the edge at pixel p. 


II.3.1.4 Segmentation 


Segmentation is the process of partitioning an image into distinct regions that corres- 
pond to individual objects. Pixels in an image might be grouped into the same segment if 
they: 
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= are of similar luminance (i.e. if they aren’t separated by an edge) , 
"are of similar colour, 
« define an area of uniform texture, 


* match the known hue and saturation histograms of the object(s) we’re interested 
in, 


* match the known texture of the object(s) we’re interested in. 


One way of finding all the pixels in the same segment as a given pixel is to use a flood 
fill algorithm to iterate over its neighbours, using some of the above criteria (or some other 
criteria) to determine where the boundaries lie. 


11.3.1.5 Background Subtraction 


The goal of background subtraction is to identify and mask out pixels in an image that 
are part of the background, so that pixels corresponding to the foreground objects can be 
identified. 


The simplest background subtraction method for image sequences with a static back- 
ground would be to define an image B of the background on its own, and then classify all 
pixels of f that are similar to the corresponding pixel in B as background pixels. Two pixels 
can be deemed to be similar if their absolute difference is less than some threshold. 


This method would almost certainly fail however as it wouldn't be robust to noise. An 
improvement on this method is to calculate B as a running average of background images 
over an arbitrary number of frames, so that B can be taken as the expected background im- 
age. Pixels of f that are classified as background pixels can be used to update the corres- 
ponding pixels of B, however the potential for misclassification could lead to a degrada- 
tion in the accuracy of B. 


The reader is referred to [AMM1] for a review of more background subtraction tech- 
niques that have been published in the literature, such as the adaptive mixture of Gaussi- 
ans method. 


II.3.2 Shape Recognition 


The techniques outlined in the previous section allow us to extract primitive features 
from an image such as the pixels that lie on an edge or segments that correspond to objects 
(or part of an object). We must process these features at a higher level in order to determ- 
ine more useful information about the image, such as whether or not the image contains 
an object of interest. 


Suppose we have extracted a set P of 2D coordinates that correspond to the pixels of an 
image segment. To determine whether or not P represents the shape we are interested in, 
we need to evaluate it with respect to a model of our prior knowledge of that shape. 


One common approach to this is to define a metric M(P) that represents how well P cor- 
responds to our model. Any segment P’ for which M(P’) has a low value can be discarded 
as an unlikely candidate. Depending on our requirements, we may then take all segments 
for which M(P) > some threshold T as the desired objects, or we may take the segment or 
segments that maximise M(P). 


MEH OEE ESS Ws Fig. II.3-1 shows the typical appearance of the baton. A metric 
age of a black baton on a_ that determines how similar a given segment is to a baton should 
ae SES take into account the segment’s length, thinness and straightness. 

The length of a segment P is given by the Euclidean distance 
‘, between the two pixels po and p; of P that project onto the outer- 
a most extremities of P’s central axis. The standard deviation of 
%, pixels about this axis can be taken as a measure of P’s thinness. 
‘ By only considering segments that are sufficiently thin as candid- 


Ty, ate batons, we implicitly force the candidates to be straight. 


Given the length Lp of segment P and standard deviation op of 
~ its pixels about its central axis, we can define M(P) as: 


» Omin < Op SO 


min — 


M(P)= Lp=||Pi- Po Lnin <= Lp SL 


0, otherwise 


max » max 


Where: 


Ominy Onarr Lininy Limax define the min/max permissible values of op and Lp. 


Given p> and pi, we can use the fact that the baton’s tip will usually be closer to the edge 
of the frame that the conductor is facing to determine which ends of the baton pp and p; 
correspond to (the user can specify what direction the conductor is facing). If P’s central 
axis is near-vertical, the point furthest from the horizontal line that runs through the 
middle of the frame can be taken as the tip. 


To measure Lp and op from P, we first of all need to determine P’s central axis. We can 
use regression to do this. 


10 
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II.3.2.1 Linear Regression 


Given a set of data Pa (X59); vey (Hyg Vigra)! (the (x, y) terms can in general be scal- 
ars or vectors), regression in its most general form can be defined as the process of finding 
an optimal parameterisation of a function f such that: 

yi=f(x,,8)+e,, OS i<N 
Where: 
€o,-,€y-1 are error terms to be minimised by the optimisation process. 


& is the set of f’s parameters. 


A commonly used optimisation scheme is the least squares method, in which we find & 
as follows: 


N-1 
& = argmin > €; = argmin ) (y,—f(x,,2'))° (Eq. 1.3.2.1) 
5" i=0 j 


To find the central axis of P, we require a linear function f(x) = a + bx, where © = (a, b). 
Each €; is then equal to y; — f(x;) = the vertical distance from (x;, y;) to (xi, f(xi)). The closed- 
form solution to eq. II.3.2.1 when f is of this form can be found by calculating the values of 


N-1 N-1 
a and b for which ap cko and ral = ceo. The solution is given in [MWWR2]. 
i=0 


i=0 
However f is undefined when the central axis is a vertical line. One solution to this prob- 
lem is to use polar coordinates. 


II.3.2.2 Polar Coordinate Linear Regression 


To solve the regression problem using polar coordinates, we must express the problem 
in vector form. We seek a vector L: 


L(u; p,0)=ptu(cosé, sin@) 


s.t. (x, vJ=L(u,; p,.0)+€, 


N-1 
2 I(x, yJ-L(us BO IP (Eq. 11.3.2.2) 
i=0 


@Q" 


N-1 
and (p, 0)=arg min| >> el fea min 
(b',0") | i=0 (p',0") 


Where: =u; =((x;, ¥;)— p)-(cos@,sin@) = projection of (x;,¥;)-P onto (cos@, sin@) . 


Note: 0 is measured anticlockwise from the x-axis (see Fig. A-1). 
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Each €; error term now defines the perpendicular vector from (x;, y;) to L. The paramet- 
ers that we must find are p and 0. Theorem II.3.2.2-1 gives a closed-form solution to this 
problem. 

Calculating the parameters using theorem II.3.2.2-1 involves calculating a function 
R(cos @,sin 0), which gives the sum of the squared perpendicular distances between vec- 
tor L and the elements of P. Hence, the standard deviation op of the elements of P about 
R(cos@, sin@) 

N-1 

To find po and p;, the outermost extremities of segment P, we transform P’s elements in 

such a way that their central axis lies on the x-axis, take the leftmost and rightmost trans- 


central axis L is givenby o, =| 


formed elements, and then apply the inverse transformation to transform them back to the 
original coordinate space. The transformation is simply a translation by - P followed by a 
clockwise rotation by @ radians. 
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Theorem II.3.2.2-1 — Closed-form solution to the polar coordinate linear regression problem. 


Given a set P= Lose oAs EEN Cea ase , the line L(y; p,0)=p+pu(cos@,sin@) satisfies 
eq. II.3.2.2, i.e. minimises the sum of the squared perpendicular distances between itself 
and the elements of P, where the parameters are calculated as follows: 


i N-1 N-1 
B= (Deg | pe aes 
N \i20 i=0 


(cos@,sin@) = arg min R(w) 
weQ 


=n — Cas Sec nase) Sl Ak 
R:Q->|IR 


R(w) = 3 [vv,- (v0) | 


cos(2 p)|+1 
c= eonig= oe 
s =|sin($)| =|V1-c"| 
lcos(2)| =| —4—= 
A+B 
1S a 2 
A= 7 2 (v2—v?] 
N-1 
B = Wee Vas 


The proof of this is given in Appendix A. Note that by expanding the expressions for A, 
B, and R(...), it is possible to calculate p and (cos@,sin@) ina single pass over the data. 
See section III.2.1.4 for the details of this expansion. 


11.3.3 The Condensation Algorithm 


The condensation algorithm belongs to a class of algorithms known as particle filters. 
Particle filters provide a simulation-based approach to estimating the parameters of some 
model from observed data. In the case of this project, we are interested in a model of the 
motion of the baton. The parameters that we need to estimate define the baton’s state (i.e. 
its position, velocity and acceleration) at any given time. 


The formulation of the condensation algorithm that we used as a reference ([MI1]) was 
designed for the near-real-time tracking of curves in a cluttered environment. Its main ad- 
vantage over other tracking algorithms (such those based on the Kalman filter) are that it 
is able to track objects through multi-modal environments, where there are multiple po- 
tential candidates for the single object we're trying to track. 


The following sections explain the theoretical basis of the algorithm. As in [MI1], we 
will begin by considering the problem of finding an object in a static image, after which we 
will show how the technique can be extended to tracking an object in a video sequence. 


II.3.3.1 Factored Sampling 


Suppose we have a set Z of data extracted from an image (e.g. by any of the means de- 
scribed in section II.3.1) and we want to use this data to determine the state X of an object 
in that image. Let P(X) be the prior probability of the object being in a particular state. By 
Bayes’ rule we have: 


P(X|Z) =k-P(Z|X)-P(X) (Eq. 1.3.3.1) 


Where: 


k is a normalising constant. 


The technique of factored sampling can be used to estimate P(X|Z), the posterior prob- 
ability of the object being in state X given Z. First of all we generate a sample set 
S=[598),---,8y-1} for some N from the prior distribution, P(X). We then define a random 
variable X’ with the following probability distribution conditional on Z: 


P(X'=s,|Z)= P(2Z|X =s,) 


N-1 

(Eq. 1.3.3.2) 
> P(Z|X'ss,) ’ 
j=0 


P(X’=s; | Z) approximates P(X |Z) with increasing accuracy as N increases. The expected 
value of some function f(X) can then be estimated as: 


14 
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ELP(O| ZINE (XC 9|Z]=>. f(s) P(X '=s, |Z) eantess 


i=0 


Thus for f(X)=X, E[f(X’)|Z] approximates the mean state of the object, which we take as 
its state in the image. 


II.3.3.2 Conditional Density Propagation 


The condensation algorithm extends the factored sampling technique in order to track 
the motion of an object through a sequence of images by propagating the distribution of 
the posterior P(X'|Z) through time. We denote the posterior for time t as P(X,| Z,), 
where X, is the state of the object at time t, 2, = {Z),Z,,-..Z,} and each Z; denotes the 
data extracted from the image that occurred in the sequence at time 7. 


Our estimation of the prior is based on the following equation: 
P(X,| Z,) = k, P(Z,| X,) P(X,| 2) (Eq. 1.3.3.4) 


Where: 


k, is a normalising constant. 
The prior, P(X, | Z,_:), is defined as: 


PUL NS 2S [| PG NG Pe oh di (Eq. 1.3.3.5) 


We sample from the prior P(X, Paget by generating a sample s’ from P (Xa ey 
the posterior distribution from time t-1, and then making a prediction conditional on s’ 
from the motion model P(X,|X,.=s’). At time t=1 however, Z2,-;=Z)=%, and hence 
P(X, | Z,_,)=P(X,), so we simply sample directly from P(X,) in this case, which handles 
the initialisation of the tracker. 


The algorithm is given in alg. II.3.3.2-1 below. The reader is referred to [MI1] for proof 
of the correctness of the above equations and the algorithm. It is based on the following as- 
sumptions: 


» Let ¥,=(X%),X,,.5X,}, 


« The object’s state at time t is only dependent on its state at time t-1. This motion 
model is expressed by the following first-order Markov chain: 


P(X, | ¥,_,) = P(X,| X,_,) 
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« The observed image data at each time step t is independent of: the observed im- 
age data at all other time steps; the motion model, i.e.: 


t-1 


PX AS PO 4) TT Z\X) 


i= 


Alg. I1.3.3.2-1 — The Condensation Algorithm: this iteration of the algorithm propagates our estimate S; of the pos- 
terior P(X‘-1|Z:-1) from time step t-1 to time t. 


— From time step t-1 we have sample set S:.1 = {<Si, P(Xt-1=5i|Z:-1) >; 0 < i < N} 


— We need to calculate S;: 
St — ) 


For eachi,0 <i <N: 
Select an element <s;, p;> from S;1 with probability p;. 
Predict a new sample s;’ by sampling from P(X; | X;-1=5)). 
pi —P(Z;| X: = 5;’). 
StS; U {<5;’, pi’>}. 


Normalise all of the p,’ probabilities in S;. 
Output E[X; | Z:] as the mean state of the object at time t. 


II.3.3.3. Importance Sampling and (Re)initialisation 


Importance sampling provides us with a way to reduce the variance of the samples in 
the set S, generated by the conditional density propagation method. This has the effect of 
reducing the size of the sample set needed to track an object with a given degree of accur- 
acy, thus improving the tracker’s efficiency. 


The idea is to introduce a distribution 9(X;) (sometimes called an “importance 
function”) that is biased towards the most likely values of X;. Rather than drawing 
samples from the posterior from the previous iteration and making a prediction from 
these using our motion model, we draw samples from 9(X;) and then weight the likelihood 
so as to make our estimate of the posterior unbiased. Thus after drawing a sample s’ from 
9(X,), we add the element <s’, w’-P(Z,| X;=s’)> to S,, where w’ is calculated as follows: 


N-1 


P(X, =5,|2,4)-P(X,=5'|X,,=5,) (Eq. 1.3.3.6) 


wl j=0 


Where: 
<Si, P(Xi4 | Zi.1)> E Si 0 <j < N. 


Notice that the numerator in this equation is a discretisation of eq. II.3.3.5. Thus 
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w'P(Z,|X,=s') * P(X,=s'| Z,_,)P(Z,| X,=s")-(g(X,=s"'))'. This gives us an un- 
biased estimate of the posterior P(X=s’|Z,) as the probability of drawing sample s’ is 
g(X,=s') (see [WP1] for more details). 


To (re)initialise the system we simply take 9(X;) as the prior distribution and sample 
from it. Since (re)initialisation ignores the previous state of the system, the original 
factored sampling posterior estimator (eq. II.3.3.2) should be used to calculate the posteri- 
or in this case. 


II.4 Sampling Probability Distributions 


Stochastic tracking algorithms, such as the condensation algorithm, rely heavily on the 
idea of randomly sampling probability distributions. The following sections explain how 
we can sample from the three main types of probability distribution that are relevant to 
this project: uniform distributions, empirical distributions and multivariate Gaussian dis- 
tributions. 


II.4.1 Sampling Uniform Distributions 


The uniform distribution forms the basis of most other probability distribution 
sampling algorithms, including those mentioned above. Hence if our uniform distribution 
sampling algorithm doesn’t generate samples that are in some sense representative of a 
uniform distribution, we will not be able to use those samples to generate samples that are 
representative of any other distribution. Thus before adopting a particular PRNG (Pseudo 
Random Number Generator), it is important to consider whether or not the statistical 
properties of its output will be satisfactory for the task at hand. 


One method for testing how uniformly distributed the output of a PRNG is is the k-dis- 
tribution test. [AKCDL1] defines a sequence of random variables Xo, X1, X2, ... € [0, 1) as 
being k-distributed if: 


Wn, AE[0,1), BE[0,1)',| P(4, <_X, < By... A 


n 


Where: 


A;, B; are the i” components of the k-dimensional vectors A and B. 


In practise no PRNG can be k-distributed for all A, B as required by the definition due to 
computational limitations. To account for this, the hypercube domain of A and B is usually 
quantised as follows: 


A,BeE O={i!2’; 0<i<2"}*, veEIN 


A PRNG that is k-distributed for all A and B in Q is said to be k-distributed to v-bit ac- 
curacy. Clearly the greater the degree of this accuracy, the more uniformly distributed a 
PRNG'‘s output is. 


Another important test to consider is the lag-k autocorrelation. This measures how inde- 
pendent the samples generated by a PRNG are. It is defined as: 
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It can be shown that the lag-k autocorrelation of a uniform distribution is 0 for all k. The 
independence test that arises from this measurement involves determining how close to 0 
the lag-k autocorrelation is for a particular PRNG over a range of values of k. 


One PRNG that scores well in both of these tests is the Mersenne Twister generator. 
[MMTN]1] reports that it is 623-distributed to 32-bit accuracy, which is considerably higher 
than most other PRNGs. Tests performed by [AKCDL1] show that it passes the lag-k auto- 
correlation test for most k between 1 and 10. Furthermore it has a colossal period of 2'”*”-1 
(hence the name “Mersenne”) and it is highly efficient, as it only uses bitwise operations 
and additions. All of these factors make it a suitable PRNG for this project. A more de- 
tailed discussion of the algorithm along with source code can be found in [MMTN1]. 


II.4.2 Sampling Empirical Distributions 


An empirical distribution for a random variable X is a discrete distribution of finite 
range R, for which the probability density function f(X) has been estimated by experiment. 
We can sample from such distributions by using the inverse transform method. 


The inverse transform method is based on the fact that for any cumulative distribution 
function F(X), P( F(X) < uv) = u for all 0 <u < 1 => F(X) ~ UC, 1) = X ~ F‘(U(O, 1)). But 
X~f(X) => F'(U(0, 1)) =f(X). Therefore, we can generate samples from any distribution 
whose cumulative distribution function is known and can be inverted by first of all gener- 
ating samples from U(0, 1) and then transforming them by the inverse of the CDF. The 
proof of this result is given in [WP2]. 


So given an empirical distribution f(X) where X is discrete and ranges over Xo, X1,...XR-1 
isi 

we can define the CDF F(X) as Y,= F(X<x,)= >) f(x,), for all i, 0<i<R. These Y/s can 
j=0 


be calculated and stored in advance. The inverse function F'(Y) is then given by a “reverse 
lookup” approach, i.e.: 


EUV jSx (fF H0nRY sY)\VOxjeRAY =< SY) 


This inverse function can be calculated efficiently (in O(log, R) time) using binary 
search. 
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11.4.3 Sampling Gaussian Distributions 


The CDF of a Gaussian distribution cannot be expressed in closed form, and so we can- 
not sample from a Gaussian distribution using the inversion method. There is however an- 
other approach known as the Box-Muller transform, which generates a pair of independ- 
ent, standard normally-distributed random variables X; and X; from a pair of independ- 
ent, uniformly-distributed r.v.s Up and U;. [JG1] gives proof of the method, and so we will 
only state the results here. 


The method can be expressed as a pair of closed-form trigonometric equations and as a 
rejection-sampling based algorithm that uses polar-coordinates (see alg. II.4.3-1). The latter 
formulation is generally said to be more efficient than the former despite the uncertainty 
in its termination condition. 


Alg. I1.4.3-1 — A rejection sampling-based approach to the Box-Muller transform. 
— Given a pair of independent r.v.s Uo, Ui ~ U(-1,1), this algorithm generates a sample from a 
— pair of independent rv.s Xo, Xi ~ N(O, 1). 


r—0O 


do: 
(Uo, U1) <—~Sample from (Uo, U:) 
r <—Uo? + U7? 

while: r=O Vr>1 


Output (Xo, X1) = (Uo, OPA Geese 


The termination condition depends on the generation of a random point (uo, u1) € [-1,1] 
that falls within the unit circle. Consequently, the number of times the termination condi- 
tion fails before the algorithm terminates follows a geometric distribution with success 


parameter 0.257. From this we can deduce that the average number of iterations is 
1+(1-0.25 1)-(0.257r)7! = 1.27. 


Given a sample x generated by this algorithm, we can transform it into a sample x’ from 
the distribution N(u, 0) using the equation x’ = x-o+p. 


11.4.3.1 Multivariate Gaussian Distributions 


The multivariate Gaussian distribution is a generalisation of the Gaussian distribution 
to more than one dimension. A random n-vector X = (Xo, Xz, ..., X;-1)' that follows an n-di- 
mensional multivariate Gaussian is parameterised by its mean _ n-vector 
U=(Ho,M1, 5 M,-1)) and its nxn covariance matrix E = E[(X-p)-(X-)"]. 


In the special case where the off-diagonal covariance values in & are all zero, the ele- 
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ments of X are all mutually independent, and so we can generate a sample from f(X; y, X) 
by generating n independent samples from X; ~ N(;, 4; ;) for each X; using the Box-Muller 
transform. In the more general case where the off-diagonal covariance values are non-zero, 
we first of all use the following fact to sample the multivariate Gaussian n-vector X’ = X-y 
with covariance matrix 4 and zero mean: 


If column vector Y follows an n-dimensional multivariate Gaussian distribution with zero 
mean and covariance matrix I (the nx n identity matrix), and X’=AY for some nxn matrix A... 


S=E[X-X')=E[AY(A-Y)]=AE[Y-Y'] A =ATA = AA 


So if we can find a matrix A such that © = A-A', then we can generate a sample x’ from 
X’ by first generating a sample y’ from Y (whose off-diagonal covariance values are all 
zero) and then setting x’ = Ay’. Finally, a sample x from X is given by x = x’+w. 


Matrix A can be generated by the Cholesky decomposition. A full description of this 
method can be found in [WP3], and an O(n°) algorithm to calculate it can be found in 
[NRC1]. 


Part LET: IMepLEMENTATION 


The system we have implemented presents a vision-based solution to the problem of re- 
sponding to a conductor’s gestures in real time as he conducts a piece of music. The sys- 
tem is able to interpret the motion of the conductor’s baton as an indication of the tempo 
of the piece of music he is conducting, in response to which it can play back the piece of 
music (in MIDI form) in synchronisation with him. 


The sub-sections of this part contain a detailed discussion of the approach we took in 
implementing the various components of the system. We will consider how we used the 
background material presented in the previous part, as well as the ways in which we ex- 
panded upon that material in order to achieve the desired results. Before this however, we 
present an overview of the system’s design and the function served by its main compon- 
ents. 


III.1 System Design 


When implementing any large system, it is easy to fall into the trap of over-complicat- 
ing its design to the point where its structure becomes incomprehensibly convoluted and 
impossible to maintain. This tends to be one of the major causes of project failure, and so 
one of the most critical steps in the development of our system was the creation of a clear, 
concise design that would allow us to concentrate on achieving the main goals of our pro- 
ject without having to spend too much time trying to correct or work around any limita- 
tions caused by structural deficiencies. 


III.1.1 Problem Decomposition 


The first step in designing the system was to decompose the problem we wanted to 
solve into a small set of well-defined sub-problems. We identified the following as being 
central to our solution: 


= Video streaming: the main input to the system is a video stream of the conduct- 
or conducting. The system should be able to accept video streams both from 
video files (for testing purposes) and from cameras connected to the computer 
(for interactive performance). 
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Baton tracking: the system must be able to track and output the position of the 
baton’s tip and base as observed in every frame of the video stream. Depending 
on the tracking algorithm used, the system may also need to be able to determ- 
ine additional derivatives of motion such as the velocity and acceleration vectors 
of the tip and base of the baton. 


The tracking component must be: 


Robust — it needs to be able to continuously track the motion of the baton 
for the duration of the piece of music. Realistically speaking, as far as the 


most likely usage of the system is concerned, this may mean tracking for 
3-8 minutes. 


Insensitive to noise — background noise can easily distract a tracker, 
which in turn can mislead other components that are directly or indirectly 
dependent on the tracker’s output. The tracker should be able to cope 
with “reasonable” amounts of noise, detect when it has lost track of the 


baton and reinitialise itself accordingly. 


Efficient — section III.3.1 discusses the minimum frame rate needed for the 


system to be able to follow the conductor at a given tempo. However as a 
general rule, a higher tracking frame rate allows the beat detection com- 
ponent (see next bullet point) to analyse the baton’s trajectory more accur- 
ately and to give the times at which it detects beats to a greater degree of 
precision. Furthermore an efficient tracking algorithm leaves more CPU 
time for the other components, allowing them to perform more intense 
computation if necessary. 


Beat detection: the trajectory of the baton’s tip and base must be analysed in or- 
der for the system to be able to determine the time at which each half beat oc- 
curs. Depending on the quality of the tracker’s output, this component may need 
to filter the tracked baton trajectory further, so as to avoid false positives (i.e. the 
mistaken detection of a non-existent half beat). 


Beat prediction: the beat detection component can only ever recognise the time 
at which beats occur after they have occurred. By this time it is too late to play 
the corresponding section of music. To alleviate this problem, the system must 
be able to make predictions about when future beats will occur. 


Music playback: the system must be able to play back the piece of music that 
the conductor is conducting, varying the tempo in accordance with the half beat 
times predicted by the beat prediction component. 


File output: the various system components should be able to write their output 
to disk for further analysis at a later time. 
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There is a clear flow of information from one component to the next in this list, which 
highlights some of the types of inter-component interaction that our design must take into 
account. In the visualisation of the information flow given in fig. III.1-1, components are 
grouped together into a single thread when there is a strictly sequential flow of informa- 
tion between them. Other components, such as the baton tracker and the video stream, 
need to execute concurrently however, so they are put into separate threads. 


Fig. III.1-1 - Information flow diagram. The curved rectangles represent threads running concurrently with one an- 
other, whilst the darker boxes within them represent some of the individual components. The arrows show the flow 


of data in-between threads and components. Points where an arrow forks represent information flowing to multiple 
destinations. 


“a 


Baton Beat Beat 
Tracker Detector Predictor 


Video Stream 


y ys 


(Uenenenanananaahaccsensssnsnnmy 


' VIDEO | 

: CONTROLLER | 

THREAD | 

GUI Thread }e— -0e™ 


> MIDI 


Extra components have been added to show the flow of information to the GUI-updat- 
ing thread and the file-writing thread (which is used to record system data for later ana- 
lysis). Notice also that the flow of information between the baton tracker, beat detector and 
beat predictor forms a loop. The arrow running from the beat predictor to the baton track- 
er has been added to model the fact that the tracker’s behaviour may depend on its belief 
of when the next beat will occur. 


File Output 
Thread 


Sound Card 


I1I.1.2 System Components 


Having now discussed the overall system design, we can consider the structure of the 
components in more detail. The key word that summarises our approach to this aspect of 
the design is “abstraction”. Carefully defining interfaces and abstract classes that encapsu- 
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late the desired behaviour of the individual parts whilst hiding implementation-specific 
details gave us leeway to experiment with multiple implementations of the various com- 
ponents. The following sub-sections highlight a few of the more interesting aspects of the 
component design. 


III.1.2.1 The Beat Detector Interface 


The beat detector component periodically sends information to three other components. 
It would be a waste of CPU time for these three components to poll the beat detector to 
check whether or not a beat has been detected. Worse still, the GUI and data-logging com- 
ponents may miss some of the detected beats by this method as they run in separate 
threads to the beat detector. 


To avoid these problems we used the publish/subscribe design pattern. In this pattern, 
the beat detector is viewed as a publisher (of beat times) and the three components that 
need to receive notifications of each beat time are subscribers. These three components 
subscribe to the beat detector during the initialisation phase. After this, the beat detector 
notifies the subscribers through a “notifyAll” method that iterates over all of the current 
subscribers, calling a notification method declared in the subscriber interface. 


Fig. III.1-2 - UML class diagram showing the publish/subscribe structure of the detected-beat notification proced- 


ure. 
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+notify(in beatTime : double) 
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#notifyAll (in beatTime : double) 
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III.1.2.2. The Video Processing Interfaces 


The video streaming component posed an interesting design challenge, as the two types 
of video source that this project deals with differ in their frame-access abilities. Video files 
provide random access to all frames of the video, whereas video cameras can only provide 
frames in sequential order. Yet it was desirable for us to be able to encapsulate the beha- 
viour of the two types of video source behind a common interface, so that other compon- 
ents in the system would only need to deal with one interface rather than two. The solu- 
tion we chose was to declare a method in the video stream interface that returns metadata 
describing the stream’s frame searching capabilities. 
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A similar problem arose when implementing the baton tracker interface, as tracking al- 
gorithms vary in their temporal capabilities. It is useful for testing purposes when a track- 
ing algorithm allows us to retrack the current frame, or even a past frame, with modified 
parameters. Of course, we can always retrack the current frame or a past one if we are pre- 
pared to either (a) revert to a cached earlier state; (b) retrack the whole sequence again 
from the beginning; or (c) reinitialise the tracker at the required frame, causing it to dis- 
card all of the information it has gathered up to the current point in the video sequence. 
The third choice isn’t generally a valid option however, as we would usually be interested 
in how the tracker would have performed if it had processed the video with the new para- 
meters from the start of the sequence. 


The first tracking algorithm we implemented, based on shape recognition (see section 
III.2.1), is temporally stateless in the sense that it gives the same output for any given 
frame regardless of when it occurred in the video stream, i.e. it supports random frame ac- 
cess’. The condensation algorithm on the other hand (see section III.2.2) can only retrack 
the current frame or a past one by one of the three methods listed in the previous para- 
graph. None of these methods are particularly desirable though as (a) caching the con- 
densation tracker’s state in every frame has a considerable memory requirement; (b) re- 
tracking the whole sequence again from the beginning would take too long; (c) reinitialisa- 
tion would be invalid for the reason given above. To deal with this problem we again de- 
clared a method in the baton tracker interface that returns information about the tracker’s 
ability to handle frames out of sequence. 


The video streaming components and baton tracking components are controlled by a 
class within the thread labelled “video controller thread” in fig. III.1-1. We will refer to this 
class as the video controller. When the user requests a frame, the video controller first of 
all checks the frame searching capabilities of the video streaming component and the 
frame ordering requirements of the tracker component. If these are found to be compatible 
with the requested frame number, an internal request is made for the requested frame to 
be sent to the tracker on the next iteration of the video thread’s loop. This process is shown 
in fig. II.1-3. Note that “requestFrame(frameNo)” is executed in the caller’s thread space 
(which in this case would be the GUI thread space), whilst the section beginning “if(frame- 
Requested())” is executed asynchronously in the video controller thread space. 


1. This isn’t strictly true, its estimation of the baton’s velocity and acceleration vectors does depend on the 
time ordering of the frames, but the tracker doesn’t use these vectors itself, it generates them for the bene- 
fit of the condensation algorithm-based tracker. 
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Fig. III.1-3 — Sequence diagram showing how the video controller handles frame requests. 
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11.2 Baton Tracking 


Implementing two different tracking algorithms in this project allowed us to evaluate 
their performance relative to each other and assess their respective merits and shortcom- 
ings, particularly with respect to their suitability to real-time performance. The differing 
behaviour of the two algorithms made different demands on the beat detector, which will 
be discussed later. 


NOTE: Any references made to pixel luminance within this section assume a normalised 
scale, where 1 represents full intensity and 0 represents zero intensity. 


I11.2.1 Shape Recognition-Based Tracking 


In section II.3.2 we presented a metric M(P) for determining how similar the shape of an 
image segment P is to that of a baton based on the length of P’s central axis and the stand- 
ard deviation of its pixels about its axis. We will now discuss the process by which we ex- 
tracted the segments from each frame of the video stream. We will also explain how we 
calculated the central axis of each segment, which is prerequisite to the calculation of 
M(P). 


I11.2.1.1 Background Subtraction 


Fig. I[.2-2 shows the main stages of our algorithm. Input images are generally very 
noisy, so to facilitate the segmentation phase we filtered each image with a median blur 
and used the filtered image to estimate an image of the background. Our background es- 
timation procedure worked on the assumption that the background is static and that there 
are no significant changes in the lighting. These are fair assumptions to make considering 
that the system will typically only be used indoors. 


We used the running average method to accumulate the background over a user-spe- 
cified number of frames during which the scene is empty. Given a pixel p; from the image 
that occurs at time ¢, this method approximates the average colour of the corresponding 
background pixel b; as b; = (1-a)b;-1 + a-p, where a é€ [0, 1] is a user-specified weighting 
factor. 


Once the background image B has been calculated, the algorithm generates a mask Mz, 
which is initially used to separate any foreground objects in the current image | that have 
entered the scene (such as the user) from the background. The mask is calculated by com- 
paring each pixel p; of I to its corresponding pixel pz in B, and masking out the pixel if 

| P:— Ps| < T, where 7 is a user-specified threshold. 
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The mask is used to indicate areas of the image that should be ignored when searching 
for candidate segments. Every time a new segment is found, the area of the mask that the 
segment covers is masked out so that the algorithm won’t consider it again. Eventually 
every pixel of the image is masked out and the algorithm terminates, returning the seg- 
ment P for which M(P) gave the highest value or NULL if no segment was found for which 
the metric gave a value > 0. 


Fig. III.2-1 — Images from the background subtraction phase: (a) an input image of an empty scene; (b) the estim- 
ated background image averaged over 50 input images; (c) an input image featuring the user; (d) the background 
subtraction mask with the threshold set to 0.06 (black=background, white=foreground). 


Fig. III.2-1 above shows an example of the results of this background subtraction tech- 
nique. Fig. III.2-1 (b) shows how most of the noise from the initial source images is filtered 
out by averaging. In fig. HI.2-1 (d) most of the background has been successfully removed 
using a relatively low threshold, although there are still a few noisy pixels that have 
evaded the filtering process. Notice also that some parts of the hand and baton have been 
mistakenly classified as part of the background. A higher threshold would remove the 
stray pixels that haven’t been filtered out, although too high a threshold would cause fur- 
ther degradation to the hand and baton. 


TI.2.1 Shape Recognition-Based Trackin 


Fig. III.2-2 — Flow chart showing the operation of the shape recognition-based tracker on a single image. Conditional transitions are indicated by square brackets. Actions to be 


taken when a condition holds follow a forward slash 
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I1I.2.1.2 Extracting Segments 


Our segment extraction method was based on heuristics that relate to the following two 
laws from Gestalt psychology (taken from [WP4)]): 


Law of Similarity — Our minds group similar elements together. The similarity is determ- 
ined by the form, colour, size and brightness of the elements. 


Law of Proximity — Elements that are close to each other spatially or chronologically are 
grouped together by our minds and seen as belonging together. 


Combining these laws gives us the heuristic that any two pixels that are of similar lu- 
minance and are in close proximity of each other are likely to be part of the same object. 
This is certainly true of the pixels that define the surface of the baton, as you can see in fig. 
III.2-1 (c). 


We define two pixels as being close to one another if they are neighbours. To determine 
whether or not a pixel p is similar to its eight neighbours N(p), our algorithm calculates the 
average absolute difference between p’s intensity and the intensities of those neighbours. 
Again we used a threshold to define how small this average difference must be for p to be 
considered similar to its neighbours. 


This forms the basis of the method we used to search for the pixels that make up each 
segment. The algorithm chooses an initial unmasked pixel p and tests whether or not it is 
similar to its neighbours N(p). If they are similar, the algorithm then tests each pixel p’ of 
N(p) to see whether or not it is similar to its neighbours N(p’), and so on until no more sim- 
ilar neighbouring pixels are found. All of the pixels that satisfy the similarity test are ad- 
ded to a set P that defines the segment. 


Put more formally, the algorithm constructs an image segment P located about a point p 
with intensity f(p) as defined by the following recursive function: 


g, masked (p)V pE€P 
P=h(p,t)= {pI}, — masked (p) A — s(p,T) 
{p} eon, — masked ( p) \ s(p,t) A péP 
p'EN(p 


3 ds fip)-f (Pl | 
gee, ,, IN (p)| 


Where: 
t is the similarity threshold. 


Of course, using this recursive function to compute P would be inefficient. Instead, we 
adapted the scanline-based region-filling algorithm given in [KF1]. This algorithm was ap- 
propriate to use because it fills regions by iterating across neighbouring pixels starting 
from a seed point. To adapt it to the task of segment extraction, we used our similarity test 
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to indicate the boundary of the region and we added each new pixel that the region filler 
found to a data structure that represents the image segment. 


I1I.2.1.3 An Image Segment Data Structure 


The nature of the segmentation algorithm necessitates the use of a dynamic data struc- 
ture to represent the segments. The simplest data structure that we could have used to do 
this would have been a stack of 2D coordinates. This representation isn’t without problems 
though. 


Suppose each node of the stack requires 12 bytes (8 bytes for the coordinates + 4 bytes 
for the pointer to the next node). In a typical video sequence with a frame size of 320*240 
(the size we used in our tests), the tracker would need to dynamically allocate and dealloc- 
ate memory for tens of thousands of stack nodes per frame’, in addition to the data struc- 
tures used by the region filler’. On our test machine we found this to cause significant per- 
formance problems, possibly due to memory fragmentation, as the order of the allocations 
and deallocations was highly dependent on the input image. 


Fig. III.2-3 — An example im- To solve this problem, we developed a more memory-effi- 
Boe ea: cient data structure that operates in a manner analogous to run- 
length-encoding-based data compression. The data structure is a 
list of sorted trees of pixel spans. A tree in this data structure 
contains all of the pixel spans that lie on a single scanline. The 


e422! 45 & list contains a single tree for each scanline that the segment cov- 
ers. 


A span of pixels from point (x9, y) to (x1, y) is represented as (xo, X:+1). Thus, the shaded 
segment shown in fig. III.2-3 would be represented as a list L of three trees as follows: 


L[2] = Tree{(0, 2), (5, 7)} 
L[1] = Tree{(0,7)} 
L[O] = Tree{(0, 3), (5, 6)} 


Each node of the tree contains a single pixel span and points to two child nodes. The 
spans are sorted in each tree according to their position on the scanline. The tree nodes are 
ordered in the usual way - the left subtree of node t contains all of the spans that are < ¢’s 
span, and the right subtree of f contains all of the spans that are > t’s span. To maintain this 
ordering when adding a new span s to the tree, s is merged with any spans currently in the 
tree that it lies adjacent to. The most general case that arises is illustrated in fig. I[I.2-4. 


1. In the worst case when the segments cover the entire frame, 320*240=76800 nodes would be needed. 
2. The region filler uses a stack of 20-byte elements representing spans of contiguous pixels on single scan- 
lines. New spans are pushed onto the stack every time a segment boundary on a scanline is reached. 
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Fig. III.2-4 — Adding a new pixel span to a scanline tree. 
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S is merged, it will cover ‘ Gig 


pixels 3-19. 


(Left) The pixel span tree before S is 

merged and added. There are of course 

many possible valid sorted arrangements 
1 for the nodes of this tree. 


2 


_ 


(Right) The pixel span tree after re- 
moving spans 1 & 2, merging them 
with S and adding the new merged 
span to the tree. 


Note that the theoretical possibility of a span being added to the tree that covers spans 
already contained within the tree never arises in practise, as the region filler never iterates 
over any given span more than once. 


The toy example given in fig. III.2-3 wouldn’t benefit greatly from the use of this data 
structure. A more realistic example however, such as the hand in fig. III.2-5, would require 
much less memory under this representation than it would if it was represented as a stack 
of pixel coordinates. In the worst-case example of an image segment that covers an entire 
image of size w*h, this data structure would require a list of h trees, each of which contains 
a single span, so the memory requirement would be O(/). In contrast the memory require- 
ment for a stack of pixels used to represent such a segment would be O(w*h). 
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Fig. III.2-5 — Image of the extracted segments after seg- 
mentation (segments are assigned distinct colours). 


— 


I11.2.1.4 Calculating the Central Axis 


After extracting each segment P from the image, the next stage in the algorithm is to cal- 
culate the segment’s central axis and the standard deviation of the perpendicular distances 
from each pixel to this axis. To do this we used theorem II.3.2.2-1. 


On first sight, the closed-form solution given in theorem II.3.2.2-1 seems to suggest mul- 
tiple passes over the pixel coordinate data to calculate the point p on the axis and the 
axis’ directional vector (cos 0, sin @) . In fact both of these parameters can be calculated in a 
single pass: 


Given N pixel coordinates po, Pr, «.. pws with mean P = (P,, P,) , we can calculate the terms A and 
Bas follows: 


= 5 (p,-B,P-(p,- B.S) 
= ly (Diy 2 Dy Py + By—( Dis 2 Ds Dic + Ds) 

= sks Pp 27, Py +N BSD, Pa 2. Pe + ND) 
= Ly Py — 2N By +N P-(L Pe - 2N BP, + NBD 
=5[D 7, - Dn + NPP?) 
= [z Pe- DP tal(D pw) -(D ps) 
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B = rs Vix Vy 
=>) (Px-B,)(Py—B,) 
=>) (Px Py By Pix Ps Ppt By By) 
=>) PxPy -N B,D, 


Pee pre 


These sums can all be calculated in a single pass, allowing us to calculate p and the set QO 
defined in the theorem. We can then evaluate R(w) for any @ = (@x, Wy) € Q as follows: 


R(w) = Yo [ve-v,- (v-w)'] 
= > ViVi > (v,,W, + V,@,) 
=D vv, [2d (pyr By) + 20,0, > (Pp Py)(Py= Dy) + WD, (Py By) | 
= > ViVi [o2[> aye _ N p> a 2w,W,| >. Pix Piy — N PyDy| + w,|>. Py a Np, 


The only one of these sums whose value we don’t already know from our calculation of A and 
Bis » v,:v,. This term is constant however when we are searching for the argument w € Q 
that minimises R, and so it needn't be calculated at all. 


Once we’ve found the parameters of the central axis, we then search for the end points 
of the segment by transforming the coordinates of its pixels as described at the end of sec- 
tion II.3.2.2. This requires a second pass over the coordinate data. 


Fig. III.2-6 — Image of the tracked baton points (shown as At this point we can finally calculate 
coloured squares) and the baton’s central axis (shown as 2 tho valye of the metric M(P). Fig. IIl.2-6 
green dotted line). ed? cane 

shows the result of the algorithm on the 
input image presented in fig. III.2-1 (c). 
The algorithm has correctly identified 
the black part of the baton in this case 
(the baton was painted black to make it 
stand out against the white back- 
ground). 
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111.2.1.5 Conclusion 


From what we have discussed so far, the tracker can give the end points of any segment 
P in the image that maximises the value of the metric M(P). However this doesn’t guaran- 
tee that the points it outputs are true positives. An example of a false match is shown in 
fig. III.2-7. 


Fig.III.2-7 — These images show how the tracker can generate a false positive. In image (b), the centre of the bat- 
on has been mistakenly classified as being part of the background, and so the baton is split into two unconnected 
segments. There also just happens to be a thin segment near the user’s body that’s longer than the two segments of 
the baton, and so this longer segment is incorrectly identified as the baton, as shown in image (c). 


(a) (b) 


This figure identifies the two main causes of tracker failure: 


1. The tracker may fail to locate the baton if the background subtraction stage splits 
the baton into multiple segments. 


2. The tracker will fail to locate the baton if there is another segment in the image 
that’s longer than the baton segment and thin enough to not be discarded. 


The first cause of failure in this list could have been avoided by joining up collinear seg- 
ments that are within a short enough distance of one another to support the possibility 


that they are part of the same object. Due to time constraints however, we were unable to 
do this. 


The second cause of failure can be reduced by taking the expected motion of the baton 
into account. The baton generally moves short distances over consecutive frames, and its 
motion follows a cyclic path, as shown in fig. II.2-2. Thus, by incorporating information 
about the motion of the baton into a tracking algorithm, we can reduce the degree to 
which the path it outputs deviates from the true path of the baton due to background clut- 
ter. This was our motivation for using the condensation algorithm, and our approach to its 
implementation is discussed in section III.2.2. 


The tracker will also fail if the background subtraction threshold is set so low that the 
tracker fails to identify the region of the image surrounding the baton as being part of the 
background, or if the threshold is set so high that the tracker cannot distinguish the baton 
from the background. However these two causes of failure can generally be avoided by 
setting the threshold to a value somewhere in-between. 


I11.2.2 Tracking with the Condensation Algorithm 


In section II.3.3 we presented the theoretical basis of the condensation algorithm. We 
did not, however, discuss: how to sample from and evaluate the motion model 
P(X|X,_,), how to evaluate the likelihood distribution P(Z;|X,) or how to sample from 
and evaluate the importance function 9(X;). The nature of these distributions is problem- 
specific and will be discussed in the following sub-sections. 


III.2.2.1 Sampling P(X;| X;.7) 


Recall from section II.3.3.2 that sampling from the motion model is the second step in 
sampling from the prior P(X,|Z,-,). We would ideally like samples drawn from this dis- 
tribution to be as close as possible to the true position of the baton in frame t, given the in- 
formation we have about the position and motion of the baton in frame f-1. 


Let T; be the time in seconds at which frame t occurs. The duration, 6, of frame t-1 is 
then 6.1 = T; — T;.1. If 6,1 is small for any given t, we can reasonably assume that the acceler- 
ation of the baton’s end points at time T; will be similar to their acceleration at time T7-1. 
This gives us the basis for the deterministic part of our motion model. 


We define the state X; of the baton in frame t as follows: 
x,= Care Kegel taal ee A,,) 
Where: 
(Pox, P11) = the positions of the baton’s end points at time T;. 


(Vo+ Vit) = the velocity of each of the baton’s end points at time T;. 


(Ao, Az) = the acceleration of each of the baton’s end points at time T;. 


The assignment of the indices 0 and 1 to each of the baton’s end points is arbitrary, al- 
though it should of course remain consistent from frame to frame. 


Given X, and P;, for each i € {0,1}, we can estimate the other derivatives of its motion as: 


F . V.-V, 
i,t i,t-1 As i,t i,t-l1 
he A, ,® Sei eee 


These estimates tend to the true values of the derivatives as 6;; — 0, and so given a 
high enough frame rate we could potentially estimate even more derivatives in the same 
manner. In practise though, the maximum achievable frame rate is limited by the camera 
and the efficiency of the algorithm. Besides, having an estimate for the acceleration vectors 
is good enough to allow our algorithm to predict curved trajectories. 
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Before calculating these derivatives, a value for P;;needs to be obtained. We used our 


: : ; 1 : 
assumption that A;; ~ Aj,1 for all 7, t to predict P;, using the formula s =ut+ ya , Which 


gives the distance s travelled by an object in terms of its initial velocity, u, its acceleration, 
a, and the duration of its motion, t. Our predictive formula is: 


SP ag Ope A 16, +26,_, (Eq. I1I.2.2.1) 


Where: 


Q is a 2D random variable that follows a bivariate Gaussian distribution. 


We include © to allow for sudden changes in a baton point’s velocity vector. So to 
sample from P(X! X;.1), we calculate the deterministic part of eq. HI.2.2.1 and we add to ita 
sample drawn from the distribution of Q scaled by 6,;. We assume that the mean of Q is 
zero, as the deterministic part of the equation is the best guess we can make about the bat- 
on’s position at time t from the information we have. 


Thus we used a 4D Gaussian distribution to model unpredictable changes in the x and y 
ordinates of the velocity of the baton’s base and tip. The covariance matrix of this distribu- 
tion represents information about the relationship between these velocity variables. We 
would expect there to be high covariance between the y ordinates of the baton’s base and 
tip, and so a good estimate of the covariance matrix at each time t should lead to more ac- 
curate predictions. Due to a lack of time however we were unable to do this, so we as- 
sumed zero covariance between these variables, which represents the tracker having no in- 
formation about the relationship between these variables. This still produced acceptable 
results given a large enough sample set size, as shall be discussed in section IV.2. 


III.2.2.2 Evaluating P(Z;|X;) 


When the baton can be distinguished from its background (this is an implicit assump- 
tion for both of our trackers), an edge-detecting convolution filter generally gives high 
edge strength values at the baton’s boundary. This observation makes the set of extracted 
edges a suitable candidate for our Z; data. 


The value of P(Z;|X;) can be interpreted as a measure of the likelihood of frame t having 
the edges that it has given that the baton is in state X;. Notice the use of the word “likeli- 
hood”. P(Z;|X;) is only ever evaluated as part of eq. II.3.3.2 where it is normalised over the 
samples of S;. Hence we needn’t evaluate P(Z,| X;) as a normalised probability here, we can 
calculate it as an unnormalised likelihood value. To avoid confusion, we shall now refer to 
this likelihood as L(Z;|X;), or L; for short. 


The effect of L; should be to increase the chance of strong hypotheses in the sample set 
S,; being used as the basis for predicting the samples of S,.; and to reduce the chance of 
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weak hypotheses being used for prediction. A perfect hypothesis, 

Cp = (Pose Vo4/ 40,07 Pity V147%1,z) would be one that gives the exact position of the baton’s 
end points in frame t. In terms of Z;’s data, we would expect to see 2 strong edges (the bat- 
on’s boundary) on either side of and parallel to the line p;;— po;. L; should be at its peak 
value in this case to ensure that c, is more likely to be sampled from S;, in the next iteration 
of the algorithm than any other weaker hypothesis. 


Of course there’s no guarantee of 
Fig. III.2-8 — The general case in which we need to evaluate Lt. a perfect hypothesis being gener- 
ated, so in general we want L; to act 
as a hypothesis metric whose value 
is high when the two hypothesised 
end points are near a pair of near- 
parallel edges. Thus we need to 
search for the baton’s edges in the 
area around each hypothesis. 


Searching the entire image for 
each of the N samples would be 
highly inefficient. It would be 
pointless in fact, because the value 
of L; should be low when the hypo- 


thesised points are far from the true 
position of the baton, not to mention the fact that a search of the whole image would al- 
most certainly cause our evaluation of L; to be thrown off by clutter. 


The general case is illustrated in fig. III.2-8. The labels indicate the following: 


« B, and B; are the true baton end points. The dark tube around them represents 
the baton’s edges. This tube may have gaps in it due to noise in the source im- 
age. 


« P,and P; are the hypothesised baton end points. 
" Quadrilateral RpR,R.R;3 is the search window. | LoRol = | LoR1! = | Polo! and |L,R3| 


= |L,Ro| = |P,L;1. | Polo! and |P;L;| can be set according to our belief in the ac- 
curacy with which P, and P; were predicted. 


* Line LoL; which passes through P, and P,, is the central axis of RoR:R.R3, from 
which the search should be directed. 


« The arrows are the actual lines about LL; that we will search along. Again, for 
the sake of efficiency, it would be better to sample the search window rather 
than consider every pixel of it. 


« The other solid lines are examples of background clutter that could throw off the 
search. 
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We will now define how we used the information about the edges that the search lines 
(i.e. the arrows in the above figure) intersect to determine a suitable value for L,. 


Let A be the set of search lines on one side of LoL1, and let A’ be the set of search lines on 
the other side (A’”’=A and | Al=!A’1). Furthermore, let A € A and 4’ € A’ be adjacent search 
lines on opposite sides of LoL; (again, A’’=A). 


Define: Aj) as the i” search line of A, enumerated according to the order in which they 
occur when moving from Lo to L;; N(A) as the number of edges A intersects; and A, as the 


distance between LI Ll, and the i” edge that A intersects such _ that 


Wi, j(OSi<j<N(A)> Ay <Aj;) 


We can then define the following statistics based on these sets for any AE{A, A": 


1 ' 7 Z 
u(A)= + Elna aol forr,cA,, ,|Al>0 


Ky , otherwise 


' 2 , 2 A y 
oO, (A) = E[(AjoytA i[o}) J-E[AyoytA itol| ford\;EA,. ‘ |A,,4| > 1 


k, , otherwise 


Where: 


A,.={a; AEA, N(A) > 0, N(A') > 0} 


 elAgy—Aio] forj€A,_ , |A,_|>0 


U_=42 
ks , otherwise 
CAS E[(A ty Aito) y]- 4u-(A,_) ,|A,|>1 
ky , otherwise 


Where: 


~ ~ 2 
Ahaha | _ p| Ault (Ay) [0] 
Kg 


ks , otherwise 


Where: 
A= (j,k); O< j<k<|Al, N(A,,) a 0,N (Aq) SOAP ee AK, N(A,,)) > 0)} 
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u_(A) and (A) are the mean distances between LyL; and the midpoints of the pairs 
of intersections that are closest to it. u_(A) is for the case when both of the baton’s edges 
lie on one side of LoL1, and - (A) is for the case when the edges lie on either side of it. 
They measure how close the central axis of the closest pair of edges to LoL, is, evaluating to 
0 when the central axis lies on LoL. 


o°(A) and o° (A) give the variances of the distances between the pairs of intersec- 
tions that are closest to LoL. As before, o~ ( A) is for the case when the baton’s edges lie on 
one side of LoL, and o%(A) is for the case when the edges lie on either side of it. They 
measure how parallel the closest edges to LoL; are, evaluating to 0 when they’re exactly 
parallel. 


o3(A) is the variance of the differences between the lengths of the closest intersections 
of adjacent search lines to LoL;. It measures how straight the closest edge to LoL; is, evaluat- 
ing to 0 when it’s perfectly straight. 


ky, ky, ..., ks are default constants for when these statistics are undefined. 


When all of these statistics are low, they define properties that we would expect of Z; 
given a hypothesis X; that is close to the true state of the baton, i.e. they represent the fact 
that there should be a pair of straight, parallel edges near X;. The greater any of these stat- 
istics gets, the less likely it is that the edges near X; are those of the baton. Given this, we 
combined them to define L; as follows: 


A A! A 
L,=1+ ee ee vel 
I+p_(A) 14+ p0(A') 14+y5(A) 
A A’ A 
xT .-! i ue (Eq. Ill.2.2.2) 
I+o07(A) 1+02(A') 1+0%(A) 
A A’ 
ligt, 
I+o,(A) 1+0;,(A‘) 


Note that w.(A)=p",(A') and o%,(A)=07%,(A’), and so only one of each of these terms 


is included. By scaling each term by the cardinality of the relevant subset of A or A’, we 
give more weight to terms that have more data to support them. 


II1.2.2.3 The Importance Function and (Re)initialisation 


We have now discussed all of the main aspects of our implementation of the condensa- 
tion algorithm except for (re)initialisation. As noted in section II.3.3.3, we can conveniently 
take the importance function g(X;) as the prior distribution at time t and use it to initialise 
the system. 


We used our shape recognition-based tracker to define 9(X;). Recall that the shape re- 
cognition algorithm discards segments with a non-zero likelihood in favour of the seg- 
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ment that has the highest metric value. Rather than discarding these segments, we took 
them as the range of g(X,), and took their normalised metric values as their probabilities. 
The motivation behind doing this is that by including the smaller segments in the distribu- 
tion we can include any incomplete segments of the baton (such as the two unconnected 
segments shown in fig. III.2-7), making it possible for the algorithm to draw some samples 
that are close to the baton’s true position. When these incomplete segments are very small 
however, their normalised probabilities become very small, and so they do not have much 
influence on the tracker. As mentioned before, the best way to fix this problem would have 
been to join up all collinear segments that are close to each other. 


Having defined the importance function, we also need to define the criteria for auto- 
matic reinitialisation. To do this, we considered the hypotheses of sample set S, as if they 
were members of a committee. 


When there is a consensus amongst the members of a committee, it is generally as- 
sumed that the thing they have agreed upon is the best course of action to take (although it 
is quite possible that they are all wrong!). Disagreement occurs when various factors influ- 
ence the committee members in different directions. To express this in terms of our sample 
set, the degree to which the samples disagree can be measured by how spread out they 
are. A “consensus” occurs when they all coincide with one another. 


We used the following formula to measure the degree of disagreement D(S,) of the 


samples: 
o < 
—!/— »,950y 
D(S,) 7 Ov 
1, otherwise 
2 2 Z 2: 
Oo= Voto, +07 +07, 
Where: 


2 2 2 y : a * 
Oo,x» Oo,» Fi,x, O),, are the variances of the x and y ordinates of the tips and bases of 


the samples. 


Ow is a user-defined threshold. 


o can be thought of as a generalisation of the definition of standard deviation to four di- 
mensions. It equates to zero when the hypotheses all coincide exactly, otherwise it gives 
some greater positive value. oy defines a limiting acceptable value for o, which allows us 
to take D(S,) as the probability of reinitialising any given sample. When o exceeds this lim- 
it, D(S;) = 1, and so all of the samples are reinitialised. 


In addition to this, we used a rejection threshold R, which defines the minimum accept- 
able probability for any sample. Any sample whose probability is smaller than this 
threshold is reinitialised unconditionally. This is to eliminate the possibility of highly un- 
likely hypotheses remaining in the sample set for too long. Finally, we defined another 
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parameter | that gives the probability of generating a new sample from a sample s by im- 
portance sampling rather than by conditional density propagation, given that s is not go- 
ing to be reinitialised. 


Fig. III.2-9 shows our scheme for generating new samples. The figure doesn’t show how 
the scheme fails (and outputs NULL) if g(X;) is empty when a rejected sample needs to be 
reinitialised from it. 


Fig. III.2-9 — Flowchart showing our sample generation scheme for the condensation algorithm. 
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III.3 Gesture Interpretation 


Whilst it should be clear by now that tracking was the main focus of this project, track- 
ing alone wasn’t enough for us to realise our goal. The tracked trajectories of the baton’s 
endpoints need to be subjected to further analysis so that the conductor’s tempo can be de- 
termined, and so that the time of the next half beat can be predicted. 


III.3.1 Beat Detection 


The trajectories output by the tracker are temporal discretisations of the true periodical 
motion of the baton. Fig. III.3-1 is an idealised graph of this motion. The period of the mo- 
tion is p, and there are b=3 beats per bar. The dotted lines, which occur with a frequency 
2b/p Hz, indicate when the half beats occur. From Nyquist’s theory, we know that to be 
able to sample a periodical function, our sampling rate must be at least twice the fre- 
quency of that function. 


So suppose the conductor is conducting at B beats/min = B/30 half beats/sec. The lower 
bound on the sampling rate we must achieve in order to detect the half beats is B/15 Hz. 
From our discussions with conductors, we found that the maximum tempo that a conduct- 
or can comfortably conduct at is approx. 180 beats/min. So by Nyquist’s theory our system 
wouldn’t need to run any faster than 180/15 = 12 frames/sec. 


Fig. III.3-1 —- Graph showing the change in the baton’s height with respect to time when beating 3 beats per bar. 
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This would be fine if the tracker’s output really did look like fig. III.3-1. All we’d need to 
do to detect the half beats in this case is to check when the vertical component of the bat- 
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on’s velocity changes direction. However in reality the tracker’s output tends to be very 
noisy, as fig. III.3-2 shows. 


Fig. III.3-2 — Graph showing an example of the condensation algorithm’s output. The blue line at the bottom is the 
vertical trajectory of the baton’s base, and the red line at the top is the vertical trajectory of its tip. 
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The naive method of detecting beats by looking for changes in the vertical direction of 
the baton would have produced lots of false positives in this case, due to the local peaks 
caused by noise. The data clearly needs to be filtered. 


Our approach to filtering the data is based on the observation that even when noise pro- 
duces false peaks in the vertical trajectories, the general shape of the true peaks is still pre- 
served (i.e. the signal to noise ratio is high). This allows us to define properties that a true 
peak in the data at time t should satisfy that a false peak due to noise generally shouldn’t. 


We defined the following function: 


1 
Ds [Yost Vie (Vort y,,0)| (Eq. I1L.3.1.1) 


|S, | se®@,, 


Q,(t) 


Where: 


Yow Yis are the y-ordinates of the baton’s tip and base at time t. 


®,,C{s; t<s<t+L},LEN,L>0 


This function gives the average vertical displacement of the baton’s tip and base 
between time ¢ and time t+L for some arbitrary L. ©, defines the times at which the 
sampled baton positions given by the tracker were non-NULL. By choosing a suitable 
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value for L, the polarity of this function gives us the direction of the baton’s vertical motion 
whilst filtering out the false peaks that occur due to noise, and so we can use changes in its 
polarity as an indication of the beat times. Notice that for the special case of L = 1, detect- 
ing beats with this function is very similar to beat detection by the naive method men- 
tioned above, except that this function takes into account information from both the base 
and the tip of the baton. The negative point about this function however is that increasing 
L increases the delay between the time at which a beat occurs and the time at which it is 
detected. Setting L to 5 seems to work quite well at a frame rate of 30 frames/sec however. 
The delay is '/, seconds in this case, which is acceptable. 


III.3.2 Beat Prediction 


A beat detector cannot determine when a half beat will occur until after it has occurred. 
By this point a musical response from the system would be too late. However by anticipat- 
ing when the next beat will occur, the system can prepare its response in advance, allow- 
ing it to potentially respond in synchronisation with the conductor. 


In order to predict when the next beat is going to occur, we need to model the temporal 
changes of the music. One simple way to do this is to make a similar assumption to that 
made for the motion model described in section III.2.2.1, i.e. we assume that the rate at 
which the half beat rate is changing stays approximately constant across consecutive half 
beats. 


Define t; as the time of the i” half beat, and let t,,.; be the observed time, in seconds, of 
the last half beat. The half beat rate, b;; half beats/sec, of half beat i-1 can be approximated 
by 5,_, © 1/(¢,-¢,_,) . Our assumption can be expressed in these terms as: 


b,41—6 b 
m-1| m—2 = (b, ,—b,5)b,-» ey m—2 m—3 = (b 


ee ee Ay Naas Sena 
This implies that the duration d,,1 of half beat m-1 can be estimated as: 
1 Diy 


ne = ad a ry a (Eq. III.3.2.1) 
Bin-1 (b,, ame 5) Dx sO 2 


Using this, we can predict that half beat m will occur at time: 


t,uot 


m m-1 


uae 


As with the motion model, this allows gradual changes of tempo to be predicted, but it 
is unable to predict sudden changes of tempo. Notice however that d,,.; is negative if: 


(,, ram 3)» Paes 2 s 0 
> Ds 3D 2< by, 3-5, 2 (b,, 3 +b, (Dn 3D, 2) 
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AS Dyy3Dm-2 is positive, this implies that d,,.; may be negative (depending on the values of 
bs ANA Vy2) When Vy3 > Binz, i.e. when the tempo is decreasing. 


An alternative method that doesn’t give negative values for d,,; when the tempo is de- 
creasing is to assume that the rate of change of the duration of each half beat stays con- 
stant across consecutive half beats, i.e.: 


Sd ™%2 dy -9- A m3 (Eq. I11.3.2.2) 


This however has the opposite problem of producing negative values for d,,; when the 
tempo is increasing. Our system allows for two alternative solutions to these problems. One 
is to just use one of the above equations, and assume a constant tempo across consecutive 
half beats whenever a negative value for d,,; is predicted. The other is to use eq. II.3.2.1 
when the tempo is increasing and eq. III.3.2.2 when the tempo is decreasing. 


Part IV: EVALUATION 


The following sections discuss tests we performed on our system together with the res- 
ults we obtained. In all of our tests we used videos captured at 30 frames/sec with a Philips 
TouCam Pro II webcam at a resolution of 320x240. Unlike certain other cameras, this cam- 
era had the advantage of permitting us to disable its built-in automatic compensation for 
changes in lighting, meaning we didn’t have to take changes in the camera’s configuration 
into account when evaluating the performance of our system. 


To make our tests reproducible, we performed our tests on video files captured from 
this device rather than performing the tests in real time directly from the camera. The dis- 
advantage with this is that it didn’t allow us to explore the effect of the noticeable delay 
between the time of the user’s movement and the time at which the camera captured that 
movement. This could be investigated as part of a future project. 


IV.1 Shape Recognition-Based Tracker Evaluation 


As we didn’t have any way of measuring the ground truth accurately, we were unable 
to give a quantitative evaluation of the accuracy of the shape recognition-based tracker. In- 
stead we will make the following comments about it: 


« By inspection, the tracker generally seems to be able to track the coordinates of 
the baton’s end points to within approx. 5 pixels of their observed locations. 


* The accuracy to which it tracks the baton is largely dependent on the following 
two factors: 


1. The quality of the estimated background image. Increasing the number of 
frames over which the background image is accumulated tends to reduce the 
amount of noise that escapes the background subtraction process. 


2. The contrast between the baton and the background. Regardless of how 
much noise the background subtraction process filters out, the tracker can 
never track the baton accurately when the whole baton or part of the baton is 
indistinguishable from the background. In the former case the tracker will 
fail completely. In the latter case it may track the largest part of the baton that 
is visible. 
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The contrast between the baton and the background generally decreases as 
the baton moves further away from the camera. This seems to be due to the 
fact that the baton has such a thin projection onto the camera that beyond a 
certain distance its projection onto the camera is “overwhelmed” by ambient 
light. The only way around this with the equipment we had was to keep the 
camera close to the user. This had the unfortunate effect of constraining the 
user’s movement to the small area that is visible within the camera. A better 
solution may have been to film the user at a higher resolution with a camera 
that has a better image quality, however filming at too high a resolution 
would have had an adverse effect on the system’s performance. 


« The tracked points sometimes fluctuate due to changes in the illumination of the 
baton as it moves. We painted the top 25.5 cm of the baton black, leaving 7cm at 
the bottom white to improve the baton’s contrast against the white background 
we used to test the system (as shown in fig. III.2-1). Despite this, the tracker is 
sometimes unable to distinguish between the white part of the baton and the 
black part, which causes its estimate of the position of the baton’s base to jump, 
as shown in fig. IV.1-1. This problem can generally be solved for a single image 
by setting the segmentation threshold to a lower value, however it is difficult to 
find a value that works well across all frames. 


Fig. IV.1-1 — A sequence of frames showing a fluctuation in the tracked position of the baton’s base. 


(Frame 195) 


(Frame 196) (Frame 197) 


IV.2 Condensation Tracker Evaluation 


We tested the C.A.-based (Condensation Algorithm) tracker’s accuracy by comparing its 
performance on a particular video to a benchmark set by the S.R.-based (Shape Recogni- 
tion) tracker for that video. Testing the accuracy of the tracker in this way is only useful 
when we know that the S.R.-based tracker’s output is accurate to within a given error 
bound. As we had no way of measuring this error bound, we had to judge the S.R.-based 
tracker’s accuracy by inspection. The following test is based on the video 
Indeo_test01.avi, which will be made available on our project web page in due 
course. 


We ran the S.R.-based tracker on the test video with the following parameters: 


Min. standard deviation of pixels about axis: 0.4 
Max. standard deviation of pixels about axis: 1.8 
Min. segment length: 25.0 
Max. segment length: 150.0 
No. frames for background accumulation: 50 
Background subtraction threshold: 0.007 
Segmentation threshold: 0.269 


By inspection, the S.R.-based tracker seemed to track the y-ordinates of the baton’s end 
points very accurately in the test video. This resulted in the trajectories shown in fig. IV.2- 
1. They are smooth for the most part, except for a few points around the 3 second mark 
and the 6.5 second mark. To fully-appreciate this however, the reader would need to ob- 
serve the tracker for himself. 


The S.R.-based tracker’s tracked paths for the x-ordinates of the baton’s end points were 
not as accurate however (see fig. ). The trajectory of the base is particularly noisy due to 
the fluctuations we discussed in the previous section. In view of this discrepancy, we only 
compared the tracked y-ordinates of the C.A.-based tracker to those of the S.R.-based 
tracker. 


We defined the following error function to determine how close the C.A.-based tracker’s 
output baton state in any given frame was to that of the S.R.-based tracker in the same 
frame: 


error(Bs,Ts,Be,Tc) = \|\|Bs— Ball +||Ts—Telf 


Where: 
Bs, Ts are the base and tip positions respectively tracked by the S.R.-based tracker. 
Bc, Tc are the base and tip positions respectively tracked by the C.A.-based tracker. 
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For our test we wanted to investigate the effect that changing the size of the sample set 
would have on the condensation algorithm’s accuracy. We set the parameters of the S.R.- 
based tracker that it uses to calculate the importance function to the same values as those 
given above. We set the remaining parameters to the following values: 


Edge detection threshold: 0.047 
Hypothesis rejection threshold, R: 0.001 
No. search lines on each side of the hypotheses: 10 
Probability of using importance sampling when not 0.5 
reinitialising, I: 

Search window size at base and tip: 10 
Om: 10.0 
Seed for the uniform distribution sampler used to 1 
draw samples from S,-1: 

Seed for the Gaussian distribution sampler used by 1304 
the motion model to draw samples for the baton’s 

base: 

Seed for the Gaussian distribution sampler used by 65403 
the motion model to draw samples for the baton’s 

tip: 


We performed the test by varying the sample set size from 5 samples to 500 samples, 
keeping the other parameters the same. For each sample set size, we tracked the baton 
over the 1202 frame test video and calculated the error in each frame. Fig. [V.2-2 summar- 
ises the average error for each sample set size. The full results will be made available on 
our project web page. 


Our results show a very high average error of 22.28236 with a sample set of size 5, 
however it drops rapidly as we increase the sample size from 5 to 40. Beyond 40 samples, 
the average error seems to have converged to about 2.3. The significance of its convergence 
to this particular value is questionable, as we do not know the error bound of the S.R.- 
based tracker, so this limiting value may be influenced by the fact that our error function is 
biased towards the S.R.-based tracker. 


IV.2 Condensation Tracker Evaluation 


oes 


Fig. IV.2-1 — Graph (a) shows the y-ordinates of the tip and base of the baton as tracked by the S.R.-based tracker 
during the first 12.4 seconds of the test video. The red line at the top is the trajectory of the baton’s tip, and the blue 
line at the bottom is the trajectory of its base. Graph (b) shows the x-ordinates from the same data set. 
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IV.2 Condensation Tracker Evaluation 


Fig. IV.2-2 — A summary of the results from our test of the relationship between the condensation algorithm sample set size and the average tracker error. 


Graph showing decrease in average error w.r.t. increase in sample set size 
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Part V: CONCLUSION 


In conclusion, we found the condensation algorithm to be effective in tracking the baton 
to an acceptable average degree of error. However from our tests we found that the al- 
gorithm tends to reinitialise the entire sample set in one go from the importance function 
very frequently. With a set of 5 samples, the algorithm would sometimes reinitialise all of 
them up to 8 times per second. With a larger set of 500 samples, the algorithm rarely 
seemed to run for much longer than 1 second before reinitialising all of the samples. 


This is almost certainly due to the inadequacy of our motion model. As noted earlier, 
the motion model should have taken the covariance between the velocity variables into ac- 
count. A good motion model should also take into account the expected time of the next 
beat, as this gives an indication of when the baton is going to change direction. The motion 
model could have used this information together with a prior model of the motion of the 
conductor’s baton for different time signatures (as shown in fig. II.2-2) to predict the bat- 
on’s motion more accurately. 


So although our implementation of the condensation algorithm can give good results, 
its high dependence on reinitialising itself from the S.R.-based-tracker-derived importance 
function makes it highly sensitive to all of the weak points of the S.R.-based tracker dis- 
cussed in section III.2.1.5. This makes our implementation of it unsuitable for use in any 
environment where the S.R.-based tracker’s background subtraction phase is unable to re- 
move persistently distracting background features that it would otherwise mistake for the 
baton. An example of such an environment would be one where the lighting is not static. 


V.1 Future Work 


There is a significant amount of further work that would need to be done in order for 
our system to be turned into a fully-fledged consumer product, much of which involves a 
number of interesting areas of further research. The following is a list of some of these ex- 
tensions: 


« First and foremost a more accurate motion model needs to be implemented so as 
to reduce the dependence of the C.A-based tracker on the S.R.-based tracker. Do- 
ing so should improve its ability to track the baton in a cluttered environment. 


« The S.R.-based tracker’s output could be improved, as mentioned before, by 
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joining collinear segments that are close to each other. This may allow the con- 
ductor to stand further away from the camera without degrading the accuracy of 
the tracker’s output, allowing him to move more freely. 


« The simple beat prediction methods we used are unable to cope with sudden 
changes of tempo. One way to model such changes would involve using a ma- 
chine learning algorithm (such as the one suggested in [CR1]) to learn how the 
conductor tends to pull the tempo about by rehearsing with him. Such an al- 
gorithm would be able to automatically calculate the error in its predictions by 
comparing its predicted times to the times indicated by the beat detector. A suit- 
able learning algorithm could then be applied to improve the system’s predictive 
accuracy based on these errors. 


« The quality of the MIDI sounds generated by the sound card wouldn't be very 
satisfying for a musician to work with. A much better approach to this important 
aesthetic aspect of the project is that described in [JBWSMM1], where real re- 
cordings of an orchestra playing are retimed using the Fast Fourier Transform. 
In that project the researchers pre-calculated a set of retimed audio tracks, and 
then played back the one that was closest to the true tempo of the conductor. An 
interesting area of research would be an investigation into whether or not this 
can be done in real time, so that the conductor’s tempo can be matched more ac- 
curately. 


« The system should ideally be extended to interpret a wider range of perform- 
ance directions that a conductor may give. This would involve implementing a 
more sophisticated gesture analysis component, and may require interpreting 
other aspects of the conductor’s actions besides the motion of his baton. 


« To make the system more useful as a practise tool, features could be added to 
the system that allow it to analyse the conductor’s performance and give him 
feedback on the quality of his conducting, possibly suggesting areas that the 
conductor needs to work on. 


APPENDIX A 


Proof of Theorem II.3.2.2-1 


Fig. A-1: Graph showing a 2D data set and its central axis, Given a set P of N 2D data points, the 
L(u; p,0)=p+u(cosé ,sin@), as solved by polar co- aim of polar linear regression is to 
ordinate linear regression. Gnd 4 le og ‘he forni 
L(u; p,0)=p+u(cosd@, sind) that 
minimises the sum of the squared 
perpendicular distances between it- 


self and the data points. 
Let p=(P 0 Piy) be the i!” data point. 


The proof of theorem II.3.2.2-1 begins 
with the result from principal com- 


ponent analysis that the principal 


x 


component of a data set passes 


through its mean. From this we can write down: 


P=(D,,P,) = ie Prd P| 


We will make use of the following trigonometric identities in the rest of the proof: 


cos @+sin’O =1 (Eq. A-1) 
cos 20 =cos’ @—sin’ O (Eq. A-2) 
sin2 @=2cos@ sin@ (Eq. A-3) 
cos O= ae (Eq. A-4) 


Let: 


V,=DP;—P forall0<i<N. 
L=L(1)—- p=(cos@, sin@) 
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By rearranging v;L for arbitrary 1 and then using Pythagoras’ theorem and eq. A-1, the 
squared perpendicular distance from p; to L, €; is: 


€) = v-v,-(v;L) = v,-v,-(v;,cos’ 0+ 2y,, Vv, cos sin 6+ v;, sin’ 0) (Eq. A-5) 


ix “iy 


And so we can define the sum of the squared perpendicular errors, R(@) as: 
R(@) = 2 G; (Eq. A-6) 


To find (cos 6, sin @) that minimises R(@), we equate the derivative to 0 and find the 


minimum: 


d(R(@)) - ; D522 2 - 2 = ; : 
qa -22 [sino cos0(vi,—Vv;,)+V,Vj,(Cos @—sin 0)| = 0 at turning points (Eq. A-7) 


ix “iy 


=(cos*@—-sin?@) >" v,,v,,+sin @cosd >) (v;,—v;,)=0 
and 
=(sin’@—cos’ 0) ¥) Vix Viy— Sin @ cos O 2 (v;,- Vi.) =0 


Hence if (cos 0, sin @) = (cos @, sin d) is a solution, (cos 0, sin @) = (-sin ¢, cos ¢) is also a 


solution. 


Let: 
455d, 0%) 
B= 2X (V;.Viy) 
Using eq.s A-2 and A-3 we can rewrite eq. A-7 as: 
0 = Asin(20)+Bcos(20) (Eq. A-8) 


Which implies that the vectors (B, A, 0) and (cos 26, sin 26, 0) are perpendicular. Hence 
their cross product is proportional to a unique vector given by: 


(0,0, Acos20—Bsin26) = 0,0,2||(cos 20, sin20,0)|}||(B, 4, 0)||sin> 
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> Acos20—Bsin20=+\VB°+A’, by eq. A-1 
Acos20+VB?+ & 
B 
Acos20+V B+ A’ 
B 


= sin20 


=> Bcos20+A = 0, by eq. A-8 


AVB +22 


2 
A + 
B 


B 


= c0s20)( 84 


Heszee 3! 
VBP 42 


=>|cos 6| = {eeszebe , by eq. A-4 


> |sin 6| = |/1—|cosof], by eq. A-1 


Now we simply need to determine the signs of |cos@| and |sin@| and check whether 


=> |cos26| = 


we have a maximum point or a minimum point. As the derivative has two solutions, we 
need to check (+|cos6],|sin@|) and (+|sin 6|,|cos@]) . 


Let (cos ©, sin ©) be the directional vector that minimises R(@). 


(cos@, sin@) = argmin R(w) 


weEeQ 
Where: 
QCIRXR = {(c,s),(-c,s),(s,c),(—s,c)} 
c = |cos 0| 
s =|sin 6]. 


QED. 
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